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Abstract 

Detecting the emergence of an abrupt change-point is a classic problem in statistics and 
machine learning. Kernel-based nonparametric statistics have been proposed for this task 
which make fewer assumptions on the distributions than traditional parametric approach. 
However, none of the existing kernel statistics has provided a computationally efficient way 
to characterize the extremal behavior of the statistic. Such characterization is crucial for 
setting the detection threshold, to control the significance level in the offline case as well as 
the false alarm rate (captured by the average run length) in the online case. In this paper 
we focus on the scenario when the amount of background data is large, and propose two 
related computationally efficient kernel-based statistics for change-point detection, which 
we call “M-statistics”. A novel theoretical result of the paper is the characterization of the 
tail probability of these statistics using a new technique based on change-of-measure. Such 
characterization provides us accurate detection thresholds for both offline and online cases 
in computationally efficient manner, without the need to resort to the more expensive 
simulations such as bootstrapping. Moreover, our M-statistic can be applied to high¬ 
dimensional data by choosing a proper kernel. We show that our methods perform well in 
both synthetic and real world data. 

Keywords: Change-Point Detection, Kernel-Based Tests, Online Algorithm, Tail Prob¬ 
ability, False-Alarm Control. 

1. Introduction 

Detecting emergence of an abrupt change is a fundamental problem in statistics and machine 
learning. Given a sequence of samples, xi,X2, ■ ■ ■ ,xt from a domain X, we are interested in 
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detecting a possible change-point r, such that before the change samples Xi are i.i.d. sampled 
from a null distribution P, and after the change samples Xi are i.i.d. from a distribution Q. 
Here, the time horizon t is either fixed t = To (in the offline or fixed-sample setting), or t 
is not fixed (in the online or sequential setting) since we are getting new samples. In the 
offline setting, our goal is to detect the existence, and in the online setting, our goal is to 
detect the emergence of a change-point as soon as possible after it occurs. Here, we restrict 
our attention to detecting one change-point, which is the case of monitoring problems. One 
such instance is seismic event detection (Ross and Ben-Zion, 2014), where we would like to 
either detect the presence of a weak event in retrospect to better understand geophysical 
structure, or detect the event as quickly as possible in the online monitoring setting. 

Change-point detection problems are related to the classical statistical two-sample test; 
however, they are usually more difficult in that for change-point detection, we need to search 
for the unknown change-point location r. For instance, in the offline case, this corresponds 
to taking a maximum of a series of statistics each corresponding to one putative change- 
point location (a similar idea was used in (Harchaoui et ah, 2008) for the offline case); and 
in the online case, we have to characterize the average run length of the test statistic hitting 
the threshold, which necessarily results in taking a maximum of the statistics over time. 
Moreover, the statistics being maxed over are usually highly correlated. Hence, analyzing 
the tail probabilities of the test statistic for change-point detection typically requires more 
sophisticated probabilistic tools. 

Ideally, the detection algorithm should also be free of distributional assumptions to have 
robust detection. However, classic approaches for change-point detection are usually para¬ 
metric, meaning that they rely on strong assumptions on the distribution. Nonparametric 
and kernel approaches are distribution free and more robust as they provide consistent re¬ 
sults over larger classes of data distributions (they can possibly be less powerful in settings 
where a clear distributional assumption can be made). A classic non-parametric scheme for 
change-point detection is (Gordon and Poliak, 1994), which is based on a likelihood ratio 
formed for a sequence of vectors of signs and ranks of the scalar observations. However, it is 
not suitable for vector observations, since it needs to order the observations. Recently many 
kernel based statistics have been proposed in the machine learning literature (Harchaoui 
et ah, 2008; Enikeeva and Harchaoui, 2014; Zou et ah, 2014; Kifer et ah, 2004; Song et ah, 
2013; Desobry et ah, 2005), which can be applied to vector observations and typically work 
better in real data with few distributional assumptions. However, none of these existing 
kernel statistics has provided a computationally efficient way to characterize the tail prob¬ 
ability of the extremal value of these statistics. Characterization such tail probability is 
crucial for setting the correct detection thresholds for both the offline and online cases. 

Furthermore, typically we have a large amount of background data in the change point 
detection setting {e.g., seismic events are relatively rare), and we want the algorithm to 
exploit these data while being computationally efficient. However, most kernel based statis¬ 
tics will cost O(n^) to compute based on a sample of n data points. In the change-point 
detection case, this translates to a complexity quadratically grows with the number of back¬ 
ground observations and the detection time horizon t. Ideally, we want to restructure and 
sample the background data during the statistical design to retain statistical efficiency while 
gaining computational efficiency. 
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In this paper, we design two related statistics for change-point detection based on kernel 
maximum mean discrepancy (MMD) for two-sample test (Gretton et ah, 2012; Harchaoui 
et ah, 2013), which we call “M-statistics”. Although MMD has a nice unbiased and min¬ 
imum variance [/-statistic estimator (MMD„), it can not be directly applied since MMD^j 
costs 0{n?) to compute based on a sample of n data points. Therefore, we adopt a strat¬ 
egy inspired by the recently developed i?-test statistic (Zaremba et ah, 2013) and design a 
0{n) statistic for change-point detection. At a high level, our methods sample N blocks of 
background data of size B, compute quadratic-time MMD„ of each reference block with the 
post-change block, and then average the results. However, different from the simple two- 
sample test case, to provide an accurate change-point detection threshold, the background 
block needs to be designed in a novel structured way in the offline setting and updated 
recursively in the online setting. 

Besides presenting the new M-statistics, our contributions also include: (1) deriving 
accurate approximations to the significance level in the offline case, and average run length 
(ARL) in the online case, for our M-statistics, which enable us to determine thresholds 
efficiently without recurring to the onerous simulations (e.g. repeated bootstrapping); (2) 
obtaining a closed-form variance estimator which allows us to form the M-statistic easily; 
(3) developing novel structured ways to design background blocks in the offline setting and 
rules for update in the online setting, which also leads to desired correlation structures of our 
statistics that enable accurate approximations for tail probability; (4) we further improve 
the accuracy of our approximations by including correction terms that take into account the 
skewness of the kernel-based statistics. To approximate the asymptotic tail probabilities, 
we adopt a highly sophisticated technique based on change-of-measure, recently developed 
in a series of paper by Yakir and Siegmund et al. (Yakir, 2013). The numerical accuracy 
of our approximations are validated by numerical examples. We demonstrate the good 
performance of our method using real speech and human activity data. 

Finally, through our study we notice another interesting difference between our change- 
point detection problem and the fixed sample size two-sample test. In two-sample test, it is 
always beneficial to increase the block size B as the distribution for the statistic under the 
null and the alternative will be better separated. However, this is no longer true in online 
change-point detection, because a larger block size inevitably causes a larger detection delay. 

2. Background and Related Work 

We briefly review kernel-based methods and the maximum mean discrepancy. A repro¬ 
ducing kernel Hilbert space (RKHS) F on X with a kernel k{x,x') is a Hilbert space of 
functions /(•) : A i—)• M with inner product Its element k{x,-) satisfies the repro¬ 

ducing property; (/(•), A:(x, •)) = f{x), and consequently, {k{x, ■), k{x', ■))jr = k{x,x'), 
meaning that we can view the evaluation of a function / at any point x G A as an inner 
product. Commonly used RKHS kernel function includes Gaussian radial basis function 
(RBF) kernel k{x,x') = exp(— ||x — x'\\^ /2o'^) where a > 0 is the kernel bandwidth, and 
polynomial kernel k{x,x') = {{x,x') -|- a)*^ where a > 0 and d G N (Scholkopf and Smola, 
2002). RKHS kernels can also be defined for sequences, graph and other structured ob¬ 
ject (Scholkopf et ah, 2004). In this paper, if not otherwise stated, we will assume that 
Gaussian RBF kernel is used. 
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Assume there are two sets with n observations from a domain A, where X = {xi, X 2 , ■ ■ ■, Xn} 
are drawn i.i.d. from distribution P, and Y = {yi,y 2 , ■ ■ ■ ,yn} are drawn i.i.d. from distri¬ 
bution Q. The maximum mean discrepancy (MMD) is defined as (Zaremba et ah, 2013) 


MMDoiP, P,Q] := snp{E,[f{x)]-Ey[f{y)]}. 

/ 6 ^ 


An unbiased estimate of MMDq can be obtained using f7-statistic 


MMD2[^,X,y] = 


1 


n(n — 1) 


^ h{xi,Xj,yi,yj), 


( 1 ) 


where h{-) is the kernel of the [/-statistic defined as 


h{xi,Xj, yi,yj) = k{xi, xj) + k{yi, yj) - k{xi, yj) - k{xj,yi). 

Intuitively, the empirical test statistic MMD^ is expected to be small (close to zero) if 
P = Q, and large if P and Q are far apart. The complexity for evaluating (1) is O(n^) since 
we have to form the so-called Gram matrix for the data. Under Hq {P = Q), the [/-statistic 
is degenerate and distributed the same as an infinite sum of Chi-square variables. 

To improve the computational efficiency and obtain an easy-to-compute threshold for 
hypothesis testing. Recently, an alternative statistic for MMDg has been proposed, called 
the R-test (Zaremba et ah, 2013). The key idea of the approach is to partition the n samples 
from P and Q into N non-overlapping blocks, Xi,..., X^r and Yi,, Yat, each of constant 
size B. Then MMD^[X, Xj, 1)] is computed for each pair of blocks and averaged over the 
N blocks to result in 

1 ^ 

MMD| [X, X,Y] = -Y1 • 

i=l 

Since B is constant, N ~ 0{n), and the computational complexity of MMD^[X,X, Y] is 
0{B‘^n), a significant reduction compared to MMD^[X, X, Y]. Furthermore, by averaging 
MMD^[X, Xj, 1^] over independent blocks, the R-statistic is asymptotically normal lever¬ 
aging over the central limit theorem. This latter property also allows a simple threshold to 
be derived for the two-sample test rather than resorting to more expensive bootstrapping 
approach. Our proposed M-statistics are inspired by the structure of R-statistic. However, 
the change-point detection setting requires significant new derivations to obtain the test 
threshold since one cares about the maximum of MMDg[X, X, Y] computed at different 
point in time. Moreover, the change-point detection case consists of a sum of highly corre¬ 
lated MMD statistics, because these MMD^ are formed with a common test block of data. 
This is inevitable in our change-point detection problems because test data is much less 
than the reference data. Hence, we cannot use the central limit theorem (even a martingale 
version), but have to adopt the aforementioned change-of-measure approach. 


2.1 Related work 

Other nonparametric change-point detection approach has been proposed in the literature. 
In the offline setting, (Harchaoui et ah, 2008) designs a kernel-base test statistic, based on 
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a so-called running maximum partition strategy to test for the presence of a change-point; 
(Zou et ah, 2014) studies a related problem in which there are s anomalous sequences out of n 
sequences to be detected and they construct a test statistic using MMD. In the online setting, 
(Kifer et ah, 2004) presents a meta-algorithm which compares data in some “reference 
window” to the data in the current window, using some empirical distance measures (not 
kernel-based); (Desobry et ah, 2005) detects abrupt changes by comparing two sets of 
descriptors extracted online from the signal at each time instant: the immediate past set 
and the immediate future set, and using a soft margin single-class support vector machine 
(SVM), they build a dissimilarity measure (which is asymptotically equivalent to the Fisher 
ratio in the Gaussian case) in the feature space between those sets without estimating 
densities as an intermediate step; (Song et ah, 2013) uses a density-ratio estimation to 
detect change-points, and models the density-ratio using a non-parametric Gaussian kernel 
model, whose parameters are updated online through stochastic gradient decent. The above 
work lack theoretical analysis for the extremal behavior of the statistics or the false alarm 
rate. 

3. M-statistic 

Given a sequence of observations {..., X- 2 , X-i,xo,xi, ..., xt}, Xi G X, with {..., x_ 2 , X-i,xo} 
denoting the sequence of background (or reference) data. Assume a large amount of refer¬ 
ence data is available. Our goal is to detect the existence of a change-point r, such that 
before the change-point, samples are i.i.d. with a distribution P, and after the change-point, 
samples are i.i.d. with a different distribution Q. The location r where the change-point 
occurs is unknown. We may formulate this problem as a hypothesis test, where the null 
hypothesis states that there is no change-point, and the alternative hypothesis is that there 
exists a change-point at some time r. We will construct our kernel-based M-statistic using 
the maximum mean discrepancy (MMD) to measure the difference between distributions of 
the reference and the test data. 

We denote by Y the block of data which potentially contains a change-point (also referred 
to as the post-change block or test block). In the offline setting, we assume the size of Y can 
be up to Bmax) and we want to search for a location of the change-point B {2 < B < B m^^ ) 
within Y such that observations after B are from a different distribution Q. In the online 
setting, we assume the size of Y is fixed to be Bq and we construct it using a sliding window. 

In this case, the potential change-point is declared as the end of each block Y. Inspired 
by the idea of B-test (Zaremba et ah, 2013), we form N blocks of reference data with sizes 
equal to that of Y, and compute our statistics. 

The reference blocks and the post-change block are constructed differently in the offline 
and online cases. In the offline setting, we truncate data into blocks of size Smax; which is 
equivalent to assuming 0 < r < ihmax- We take {N -|- 1) blocks, treating the last block as 
post-change data and the remaining N blocks as the reference data. Then take B contiguous 
samples out of each block, 2 < B < ihmax (e.g., we may take the right-most side of the block 
which corresponds to the most recent data). Gompute a quadratic MMD^ for samples from 
each reference block with samples from the post-change block, and then average them, as 
illustrated in Fig. 1(a). In the online setting, we truncate data into blocks of a chosen 
size Bq, keep the most recent Bq samples as the post-change block, and use the rest of the 
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Pool of reference data j Block containing 



Block containing 



Figure 1: Illustration of (a) offline case: data are split into blocks of size i?max) indexed backwards 
from time t, and we consider blocks of size B, B = 2^, -Bmax! (b) online case. Assume we have 
large amount of reference or background data that follows the null distribution. 


data as a pool of reference data. Then take NBq samples without replacement (since we 
assume the reference data are i.i.d. with distribution P) from the reference data pool to 
form N reference blocks, compute the quadratic MMD^ statistics between each reference 
block with the post-change block, and then average them. When there is a new sample 
(time moves from t to t + 1), we append the new sample to the post-change block, move the 
oldest sample from the post-change block to the reference pool. The reference blocks are 
also updated accordingly: the end point of each reference block is moved to the reference 
pool, and a new point is sampled and appended to the front of each reference block, as 
shown in Fig. 4(b). 


3.1 Offline M-statistic 

In the offline setting, we sample N reference blocks of size i?max independently from the 
reference pool, and index them as i = 1,... ,N. In searching for a location B 

[2 < B < i?max) within Y for a change-point, we form sub-blocks from each reference block 
by taking B contiguous data points out of that block. Index these sub-blocks as X) . 
Similarly, form sub-blocks from Y by taking B contiguous data points, and denote them as 
Y^^'> (illustrated in Fig. 1(a)). Then compute MMD^ between and average 

over all blocks to form a statistic Zb 


I 


N 


i=l 


N 

E 


B 


NB(B - 1) ^ ^ 


(B) y(B) 
i,l > 




Y{B)^y(B), 


( 2 ) 


where denotes the jth sample in and Yj^^ denotes the jth sample in Y^^\ Due 

to the property of MMD^, under the null hypothesis, E[Zb] = 0. Let Var[Zs] denote the 
variance of Zb under the null. The variance depends on the block size B and the number 
of blocks N, as shown later by Lemma 1. Considering this, we standardize the statistic 
by dividing the standard deviation, and maximize over all possible values of B to form the 
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offline M-statistic. 

M= max -^s/\/Var[Zs], {offline change-point detection} (3) 


where varying the block-size from 2 to .Bmax corresponds to searching for the unknown 
change-point location. A change-point is detected whenever the M-statistic (3) exceeds a 
prescribed threshold b > 0. 


3.2 Online M-statistic 

In the online setting, to simplify later theoretical analysis, as an approximation to the 
change-point location, we set the post-change block size to Bq. Take NBq samples without 
replacement from the pool of reference data to form N reference blocks. This is reasonable 
since the reference data are assumed to be i.i.d. with distribution P. Then compute the 
quadratic statistics MMD^ between each reference block and the post-change block, and 
average them. Using the sliding window scheme described above, we may define an online 
M-statistic by standardizing the average MMD^ between the post-change block in the 
sliding window and the reference blocks: 

1 ^ 

(4) 

i=l 

where Bq is the fixed block-size, is the ith reference block of size Bq at time t, and 

y(-Bo4) ig the post-change block of size Bq at time t. The online change-point detection 
procedure is a stopping time and an alarm is fired whenever the normalized S-statistic (4) 
exceeds a pre-determined threshold b > 0: 


T = infjt : ZBo,t/y^Sii[ZBQ,t\ > {online change-point detection} (5) 

'-V-' 

Mt 

The variance of the depends on the block size Bq but independent of t, and can 

be evaluated efficiently by Lemma 1 in the later section. 

There is a tradeoff in choosing the block size Bq. A small block size has a smaller 
computational cost, which is ideal for online situations. However, a small Bq also leads to 
a lower detection power or a longer detection delay when the change-point is weak. 


3.3 Recursive implementation 

The online M-statistic can be computed recursively via a simple update scheme. By its 
construction, when time elapses from t to a new sample is added into the post-change 

block, and the oldest sample is moved to the reference pool. Each reference block is updated 
similarly by adding one sample randomly drawn from the pool of reference data, and the 
oldest sample is purged. Hence, only a limited number of entries in the Gram matrix that 
are due to the new sample need to be updated. The update scheme is illustrated in Fig. 
2 and explained in more details therein. The recursive scheme reduces the computational 
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cost of the online M-statistic to be linear in time. Similarly, the offline M-statistic can 
also be compnted recursively by utilizing the fact that Zb for B G {2,..., i?max} shares 
many common terms. The recursive scheme reduces the computational cost of the offline 
M-statistic to be 
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Figure 2: Recursive update scheme to compute the online M-statistics. The online M- 
statistic is formed with N background blocks and one testing block and, hence, we keep 
track of N Gram matrices. For illustration purposes, we partition the Gram matrix into 
four windows (in red, black and blue, as shown on the left panel). At time t, to obtain 

we compute the shaded elements and take an average within each 
window. The diagonal entries in each window are removed to obtain an unbiased estimate. 
At time t -|- 1, we update and with the new data point and purge the oldest 

data point, and update the Gram matrix by moving the colored window as shown on the 
right panel, computing the elements within the new windows, and taking an average. Note 
that we only need to compute the right-most column and the bottom row. 


3.4 Analytic expression for Var[Zs] 

We obtain an analytical expression for Var)^^] in (3) and (5), by utilizing the correspon¬ 
dence between the MMD^ statistics and [/-statistic (Serfling, 1980) and known properties of 
[/-statistic. We can also derive the covariance structure for the offline and online Z-statistics 
(Lemma 2), which are used to establish the significance level and the ARL properties. 


Lemma 1 (Variance of Zb under the null) Given a fixed block size B and number of 
blocks N, under the null hypothesis, 


yar[ZB] = 


B 


-1 


N -1 


—E[h^{x, x', y, y')] H-^Cov [h{x, x', y, y'), h{x'', x'”, y, y')] 


N 


, ( 6 ) 


where x, x', x", x'", y, and y' are i.i.d. with the null distribution P. 

Lemma 1 provides a much simpler way to estimate Var)^^] compared to a naive way of using 
the sample variance oi Zb via bootstrapping, which requires a huge amount of samples. For 
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instance, to generate 10000 instances of Zb, we need a total of 10000(A^ -|- 1)B samples, 
since each Zb needs {N + 1)B samples. On the other hand, via Lemma 1, we only need 
to evaluate '&[h?{x,x',y,y')] and Cov[h{x,x',y,y'),h{x'',x''',y,y')\, which requires much 
smaller number of samples. For instance, we may draw four samples without replacement 
from the reference data as x, x' , y, y' , evaluate the sampled function value, and then form 
a Monte Carlo average; the other term Cov[/i(x, x', y, y'),/i(x", x'", y, y')] can be estimated 
in a similar fashion. 

Lemma 1 is quite accurate. We form 10000 instances oi Zb using data sampled from 
the null distribution AA(0,/ 2 o)- Here Ik denotes an identity matrix of size k-hy-k. Figs. 
3(a)-(b) show the empirical distributions of Zb when N = 5, and B = 2 or B = 200, 
respectively. We also plot the Gaussian probability density function with mean equal to 
the sample mean, and the variance predicted by Lemma 1, and it matches well with the 
empirical distribution. Figs. 3(c)-(d) show the Q-Q plot for these two cases. This verifies 
that the empirical distribution oi Zb can be reasonable approximated using its first and 
second moments, which is basis our main theoretical results in Theorem 3 and Theorem 4, 
We also present a method to correct moderate skewness of Zb in Section 6. Figs. 3[|e)- 
(f) show the percentage difference between estimation by Lemma 1 relative to the sample 
variance of Zb- Also, the error decreases with more reference data. Finally, the estimate is 
reasonably accurate even with moderate amount of reference data. 

3.5 Examples of M-statistic 

In this section, we consider a few examples of the M-statistic. 

• Gaussian to Laplace. In Figs. 4(a)-(c), data before the change-point are i.i.d. from 
AA(0,1). After a change-point at r = 250, data are i.i.d. from a Laplace distribution 
with zero mean and unit variance. In this case, the mean and variance before and 
after the change-point are identical and, hence, conventional methods based on mean 
and variance cannot detect the change. The upper panels show the offline and online 
M-statistics in different settings, which show that the M-statistic can detect the 
change-point using the theoretical thresholds obtained from Theorem 3 and Theorem 
4 (shown by the red lines). 

• Gaussian to Gaussian mixture. For Fig. 4(d), the data before the change are 
i.i.d. multivariate Gaussian N'{0,l2), and after the change point at r = 250, are 
i.i.d. from a mixture Gaussians: 0.3J\f{0, 12 ) -|-0.7AA(0, 0 . 1 / 2 )- As shown in Fig. 4(d), 
the online M-statistic detects the change-point quickly using the theoretical threshold. 

• Sequence of graphs. Gonsider a change-point detection problem in the context of 
detecting an emergence of a community in the network (Simonsen and Xie, 2015). 
Assume before the change, each sample is a realization of a Erdos-Renyi random 
graph with the probabilty of forming an edge uniform across the graph. After the 
change, a “community” emerges, which is a subgraph where the edges are formed 
with much higher probability inside. This models a community where the members 
inside interacts more often. In Fig. 4(e), our online M-statistic hits the threshold 
quickly after a community forms. 
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(a): B = 2, N = 5, empirical distribution 



(c); B = 2, N = 5, Q-Q plot 
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(b): B = 200, N = 5, empirical distribution 



(d): B = 200, N = 5, Q-Q plot 
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Figure 3: Accuracy of Lemma 1 in estimating the variance oi when B 


2 and 5 = 200. 


• Real seismic signal. Consider a segment of a real seismic signal. In Fig. 4(f), our online 
M-statistic crosses the theoretical threshold and detects the seismic event quickly. 
Here, we also illustrate the effect of kernel bandwidth. The performance of the M- 
statistic is affected by the kernel bandwidth. For Gaussian RBF kernel kiY,Y') = 
exp (—||T — y^|p/2cr^), the kernel bandwidth cr > 0 is typically chosen by a “median 
trick” (Scholkopf and Smola, 2001), where a is set of the median of the pairwise 
distances between data points. The accuracy of the median trick is justified by a 
recent work (Ramdas et ah, 2015) empirically and theoretically to certain extend. 


4. Theoretical properties under Ho 

The following result characterizes the covariance of online and offline Z-statistics under Hq- 
This allow us to explicitly capture their correlation structures, and it is crucial in deriving 
the significant level in Theorem 3 and the average run length (ARL) in Theorem 4. 
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(a): Offline, null 
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(d): Online, r = 250 
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(c): Online, r = 250 
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(f) Online: seismic signal 


Figure 4: Examples of offline and online M-statistic with N = 5. All thresholds are 
theoretical values and are marked in red. (a) and (b): Offline statistic without and with a 
change-point (i?max = 500, and in (b) the maximum is obtained at B = 263). (c) Online 
statistic with a change-point at r = 250 and we use Bq = 50. The procedure stops at time 
268, which corresponds to a detection delay 18. (d) Online statistic with a change-point at 
r = 250 and we use Bq = 20. The procedure stops at time 270, which corresponds to a 
detection delay 20. (e) Online statistic to detect the emergence of community at r = 100 
and we use Bq = 10. The stopping time T = 102, which corresponds to a detection delay 2. 
(f) A real seismic signal with a change-point corresponds to a seismic event. We illustrate 
A/-statistic with different kernel bandwidth, which all detect the event. 

Lemma 2 (Covariance structure of Z-statistic) Under the null hypothesis, for the of¬ 
fline case, given u,v G [2,ilmax]; 



where uV v = max{u, v}, and for the online case, given s > 0, 

r'u,v = Cov{Mu, Mu+s) = • ( 8 ) 
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In the offline setting, the choice of the threshold b involves a tradeoff between two 
standard performance metrics; (1) the significant level (SL), which is the probability that 
the M-statistic exceeds the threshold b under the null hypothesis (i.e., when there is no 
change); and (2) power, which is the probability of the statistic exceeds the threshold under 
the alternative hypothesis. In the online setting, there are two related performance metrics 
commonly used (Xie and Siegmund, 2013): (1) the average run length (ARL), which is 
the expected time before incorrectly announcing a change of distribution when none has 
occurred; (2) the expected detection delay (EDD), which is the expected time to fire an 
alarm in the extreme case where a change occurs immediately at r = 0. The EDD provides 
an upper bound on the expected delay to detect a change-point when the change occurs 
later in the sequence of observations. In the following, we present accurate approximations 
to the SL and ARL of our methods. These approximations are quite useful in controlling 
the false-alarms. Given a prescribed SL or ARL, we can determine the corresponding 
threshold value b without recurring to the onerous numerical simulations (especially for the 
high-dimensional non-parametric setting). 


4.1 Significant level (SL) approximation 

Let and E°° denote, respectively, the probability measure and expectation under the 
null, i.e., when there is no change-point. 

Theorem 3 (SL in offline case) When 6 —)• oo and b/y/B^^ —)• c for some constant c, 
the significant level of the offline M-statistic defined in (3) is given by 


max 




|^Se{2,3,...,Bn,ax} VVar[ZB] 
where the special function 


>6 


(2B - 1) 


2y/^B{B - 1 ) 


n by 


B=2 


I 2B-1 
B{B-1) 


•[1 + 0 ( 1 )], 

(9) 


(2/ii)(4>(ii/2) - 0.5) 
{u/2)^{u/2) -I- (j){u/2) ’ 


4> is the probability density function and <I>(x) is the cumulative distribution function of the 
standard normal distribution, respectively. 


The derivation of Theorem 3 uses a change-of-measure argument based on the likelihood 
ratio identity (see, e.g., (Siegmund, 1985; Yakir, 2013)), through a series of steps in which 
the large deviation part of the probability is obtained first (derived using the first and 

second order moments of Zb), followed by refinements that result from the identification of 
the contributions due to global and local fluctuations. Note that the large deviation part 
establishes the exponential rate at which the probability converges to zero, but will not 
be accurate enough. The remaining effort of our analysis produces refined approximations, 
including polynomial terms and associated constants. More details for the proof can be 
found in the appendix. 

In a nutshell, the likelihood ratio identity relates computing of the tail probability under 
the null to computing a sum of expectations each under an alternative distribution indexed 
by a particular parameter value. To illustrate, assume the probability density function (pdf) 
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under the null to be f{u). Given a function gu}{x), with a; in some index set D, we may 
introduce a family of alternative distributions with pdf fuj{u) = where 

i’ujid) = log f f{u)du is the log moment generating function, and 9 is the parameter 

that we assign an arbitrary value. It can be easily verified that fto{u) is a pdf. Using this 
family of alternatives, we may calculate the probability of an event A under the original 
distribution /, by calculating a sum of expectations: 


P{A} = E 


r V 1 




where E[?7;yl] = E[?7I{yl}], and the indicator function IjA} is one when event A is true 
and is zero otherwise; E^,, is the expectation using pdf fu}{u); = log[f{u)/fu){u)] = 

(^duiiu) — '^u){9) is the log-likelihood ratio and we have the freedom to choose a different 9 
value for each f^. 

Specihc to our setting, the basic idea of change-of-measure is to treat Z'b in (3) as a 
random field indexed by B. Relate this to above, Z'^ corresponds to guiiv), B corresponds 
to u), and A corresponds to the threshold crossing event. Then to compute the expectations 
under the alternative measures, we first choose a parameter value 9b for each measure 
associated with a parameter value B such that i/’b(0_b) = b. This is equivalent to setting the 
mean under each alternative measure to the threshold b value, i.e., EB[Zg] = b. This allows 
the local central limit theorem to be applied, since under this alternative measure, boundary 
cross event occurs with much higher probability. Second, we express the random quantities 
involved in the expectations as functions of the “local held” {Ib — £s ■ s = B, B ± 1,, 
as well as the re-centered log-likelihood ratios £b — f-B — b. These two quantities are 
asymptotically independent as 6 —)■ oo at a rate of ^fB, which further simplihes calculation. 
The last step is to exploit the covariance structure of the random held (Lemma 2(7)) and 
approximate it via Gaussian random held. This way we explicitly characterize that the 
unavoidable correlation in Z'^ and Z'^ involved in our detection statistic, since they share 
the same post-change block and reference blocks for all i = 1,..., A^. An 

application of the localization theorem (Theorem 5.2 in (Yakir, 2013)) hnalizes the result. 

We demonstrate the accuracy of the approximation in Theorem 3 on synthetic and 
real data. Synthetic data are generated i.i.d. AA(0,/ 20 ) to represent the null distribution. 
The maximum block size i?max = 20, and the number of reference blocks N = 10. First, 
given a prescribed SL value a, we compare the threshold determined by theory and by 
simulation. To obtain threshold by simulation, we run direct Monte Carlo trials to generate 
empirical distributions for offline M-statistic, and find the (1 — a) quantile as the estimated 
threshold. Tables 1 demonstrates that for various choices of i?max, the thresholds predicted 
by our theory (Theorem 3) matches quite well with those obtained from simulation for the 
synthetic data. The accuracy can be further improved for smaller a values, by a correction 
scheme (in Section 6) that utilizes the estimated skewness. 

Furthermore, we consider a real-data example generated using the CENSREC-l-C dataset 
(more details in Section 7). In this case, the null corresponds to the unknown distribution 
of the background speech signal, and we only have 3000 samples of such speech signals. 
Hence, we use bootstrap to generate 10000 re-samples to estimate the empirical distribu¬ 
tion of the detection statistic. This case is more challenging, because the true distribution 
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of the speech data can be arbitrary. Table 2 demonstrates that the thresholds predicted by 
theory is accurate even in this case. Also note that the accuracy improves more significantly 
by skewness correction in this case. 

Table 1; Comparison of thresholds for the offline case using synthetic data, determined by 
simulation, theory (Theorem 3), and Skewness Correction (SC, Equation (15)), respectively, 
for various SL value a. In the SC column, values in the parentheses represent the standard 
deviation based on 100 trials. 


a 

b (sim) 

Bmax = 10 

b (theory) 

b ( SC ) 

b (sim) 

= 20 

b (theory) 

b ( SC ) 

b (sim) 

-^max — 50 
b (theory) 

b ( SC ) 

0.10 

2.29 

2.40 

2.65 (0.10) 

2.47 

2.60 

2.90 (0.12) 

2.70 

2.80 

3.14 (0.17) 

0.05 

2.72 

2.72 

3.02 (0.12) 

2.88 

2.90 

3.25 +.14) 

3.15 

3.08 

3.46 (0.19) 

0.01 

3.74 

3.30 

3.71 (0.16) 

3.68 

3.46 

3.87 (0.16) 

4.08 

3.62 

4.02 (0.19) 


Table 2: Comparison of thresholds for the offline case using real speech data, determined 
by bootstrapping speech samples, theory (Theorem 3), and Skewness Correction (SC, Equa¬ 
tion (15)), respectively, for various SL value a. In the SC column, values in the parentheses 
represent the standard deviation based on 100 trials. 


a 

b (boot) 

B^ax = 10 
b (theory) 

b (SC) 

b (boot) 

Bniax = 20 
b (theory) 

b (SC) 

b (boot) 

Bmax — 50 
6 (theory) 

b (SC) 

0.10 

2.45 

2.40 

3.03 (0.11) 

3.09 

2.60 

3.41 (0.16) 

3.25 

2.80 

3.78 (0.17) 

0.05 

3.17 

2.72 

3.47 (0.12) 

3.84 

2.90 

3.83 (0.18) 

4.01 

3.08 

4.19 (0.19) 

0.01 

4.80 

3.30 

4.32 (0.16) 

5.51 

3.46 

4.67 (0.23) 

5.70 

3.62 

5.01 (0.24) 


4.2 Approximation to average run length (ARL) 

Theorem 4 (ARL in online case) When 6 —)• oo and 6/\/i5o —)■ d for some eonstant d, 
the average run length (ARL) of the stopping time T defined in (5) is given by 


E°°[T] = 


~¥~ 


[ {2Bo - 1 ) 


I \/^Bq{Bq — 1) 



2{2Bo - 1 ) 
Bo{Bo - 1 ) 


•[1 + 0 ( 1 )]. ( 10 ) 


Proof for Theorem 4 is similar to that for Theorem 3, due to the fact that for a given m > 0, 
the probability that the procedure defined in (5) stops before a constant time m > 0 can 
be written as 

P“{T<m}=P“| maxMt>b\. (11) 

( i<t<m j 

Hence, we also need to study the tail probability of the maximum of a random field 
Mt = \J for a fixed block size Bq. A similar change-of-measure approach can 

be used, except that the covariance structure of Mt in the online case ((8) in Lemma 2) is 
different from the offline case ((7) in Lemma 2). The tail probability turns out to be in a 
form of P°°{r < m} = m\ ■ [1 + o(l)]. Using similar arguments as those in (Siegmund and 
Venkatraman, 1995; Siegmund and Yakir, 2008), we may see that T is asymptotically expo¬ 
nentially distributed. Hence, P°°{T < m} — [1 — exp(—Am)] —)■ 0 as m —)■ oo. Consequently 
E°°{T} PS A“^, which leads to (10). 
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Theorem 4 also shows that ART is 0{e^^) and, hence, b is 0{\/log ART). On the 
other hand, EDD is typically on the order of b/A using the Wald’s identity (Siegmund, 
1985), where A is the Kullback-Leibler (KL) divergence between the null and alternative 
distributions (a constant). Hence, given a desired ARL (typically on the order of 5000 or 
10000), the error made in the estimated threshold will only be translated linearly to EDD. 
This is a blessing as it means typically a reasonably accurate b will cause little performance 
loss in EDD. Similarly, Theorem 3 shows that SL is 0{e~^^) and a similar argument can be 
made for the offline case. 

We compare the thresholds obtained by Theorem 4 with that obtained from simulation. 
Consider several cases of null distributions, the standard normal AA(0,1), exponential dis¬ 
tribution with mean 1, a Erdos-Renyi random graph with 10 nodes and probability of 0.2 
of forming random edges, as well as Laplace distribution with zero mean and unit variance. 
The simulation method for determining threshold for a given ARL uses 5000 direct Monte 
Carlo trials. Note that the ARL predicted by Theorem 4 only depends on the number of 
blocks and is independent of the true underlying null distribution. This is verified by our 
simulation in Eig. 5 as the thresholds predicted by Theorem 4 are reasonably accurate for 
all cases of null distributions. 

Fig. 5 also demonstrated that theory is quite accurate for various block sizes (especially 
for larger Bq). However, we also note that theory tends to underestimate the thresholds. 
This is especially pronounced for small Bq, e.g., Bq = 10. The accuracy of the theoretical 
results can be improved by skewness correction, shown by black lines in Fig. 5, and are 
discussed later in Section 6. 

5. Power and EDD under Hi 

In this section, we compare the power of the offline M-statistic and the EDD of the online 
M-statistic with alternative methods when there is a change-point. 

5.1 Offline: comparison with parametric tests 

We compare our offline M-statistics with two commonly used parametric tests, the Hotelling’s 
and the generalized likelihood ratio (GLR). Given a batch of observations {xi,X 2 , ■ ■ ■, Xn} 
with Xi G i.i.d. from a distribution P (note that n corresponds to Rmax in our setting). 
For any possible change-point k, the Hotelling’s statistic is defined as 

T {k) = - - -(xfc-Xfc) S (xfc-Xfc), 

where, Xk = Y^i=i Xi/k, x*^ = Xi/{n - k) and 

( k n 

^(Xi - Xi){Xi - Xif+ ^ {xi - x*){xi - x*Y 
i=l i=k+\ 

The the Hotelling’s detects a change whenever maxi<fc<„ maxr^(A:) exceeds a threshold. 
The GLR statistic is defined as 

(.{k) = nlog|S„| - /clog|Sfc| - (n - /c)log|S^|, 
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Figure 5: For a range of ARL values of the online M-statistic, comparison of thresholds b 
determined from simulation, versus b from Theorem 4 and from the skewness correction (15) 
under various null distributions. Shaded areas represent standard deviations for skewness- 
corrected thresholds. 

where = k~^ (j2i=i{xi - Xi){xi - Xif'^ , and = {n-k)~^ 

The GLR test detects a change whenever maxi<fc<„ l{k) exceeds a threshold. 

In the examples, we use n = Rmax = 200, let the change-point occurs at r = 100, and 
choose the significance level a = 0.05. Thresholds for the offline M-statistic is obtained 
from Theorem 3, and for the other two methods are obtained from simulations. Consider 
the following six cases: 

Case 1 (mean-shift); distribution shifts from J\f{0,l2o) to AA(0.1 • l,/ 2 o); 

Case 2 (mean-shift): distribution shifts from Af{0,l2o) to A/’(0.2 • l,/ 2 o); 

Case 3 (variance-change): distribution shifts from AA(0,/ 20 ) to A/’(0, S), where [S]ii = 2 
and [S]ii = 1, i = 2,... ,20; 
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Table 3: Power, offline, thresholds for all methods are calibrated so that a = 0.05. 



Case 1 

Case 2 

Case 3 

Case 4 

Case 5 

Case 6 

M-statistic 

0.71 

1.00 

0.26 

1.00 

0.37 

0.44 

Hotelling’s 

0.18 

0.88 

0.07 

0.87 

0.19 

0.03 

GLR 

0.03 

0.05 

0.07 

0.12 

0.04 

0.04 


Case 4 (mean and variance change); distribution shifts from AA(1, / 20 ) to AA(0.2 • 1, S), 
where [S]ii = 2 and = 1, i = 2,..., 20; 

Case 5 (model for sparse slope change (Cao et ah, 2015) and the post-change mean 
increases with a constant rate), distribution shifts from Xi i.i.d. AA(0,/ 20 )) f = 1,..., 100, 
to Xi i.i.d. I 20 ), i = 101,..., 150, with [fii]j = 0.02(z — 100), if j £ S, for a set S with 
cardinality |5| =2; 

Case 6 (Gaussian to Laplace): distribution shifts from AA(0,1) to Laplace distribution 
with zero mean and unit variance. In this one-dimensional case we compare against the 
Shewhart control chart (which corresponds to the one-dimensional version of Hotelling’s 

r2). 

Above, 0 denotes a vector of all zeros, 1 denotes a vector of all ones, and denotes 
the iji\i element of a matrix S. 

We evaluate the power for each case using 100 Monte Carlo trials. Table 3 shows that the 
M-statistic has higher power than the Hotelling’s statistic and the GLR statistic in all 
cases. The GLR statistic performs the worst, especially in the high-dimensional instances, 
since it needs to estimate the post-change covariance matrix from a very limited number of 
samples. 

5.2 Online: comparison with Hotelling’s 

We hx ARL for all procedures to be 5000 and the block-size Bq = 20. We compare the 
online M-statistic with a modified Hotelling’s statistic^, which is given by 

T‘^{t) = Bo{xt - jlY%~^{xt - fi), 

where xt = ^ 0 , and p. and S are estimated from reference data. A change- 

point is detected whenever T‘^{t) exceeds a threshold for the first time. The threshold for 
online M-statistic is obtained from Theorem 4, and from simulations for the Hotelling’s 
statistic. To simulate EDD, let the change occur at the hrst point of the testing data. 
Consider the following cases: 

Case 1 (mean shift); distribution shifts from AA(0,/2o) to AA(0.2 • 0,/2o); 

Case 2 (mean shift); distribution shifts from AA(0,/2o) to AA(0.3 • l,/2o); 

Case 3 (covariance change): distribution shifts from AA(0, 120 ) to AA(0, S), where [Ejn = 
2, where i = 1, 2,..., 5 and [S]** = 1, f = 6,..., 20; 

Case 4 (covariance change): distribution shifts from AA(0, 120 ) to AA(0, 2 • / 20 ); 

Case 5 (slope change): similar to the offline case, we randomly choose two dimensions 
with mean increasing at rate 0.01; 

1. Note that we do not compare the online M-statistic with the GLR statistic, since Hotelling’s consis¬ 
tently outperforms GLR in the high-dimensional setting. 
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Table 4: EDD, online, Bq = 20, thresholds for all methods are calibrated so that ARL = 
5000. 



Case 1 

Case 2 

Case 3 

Case 4 

Case 5 

Case 6 

Case 7 

Case 8 

M-statistic 

67.47 

24.20 

29.10 

20.00 

83.00 

49.18 

33.81 

20.00 

Hotelling’s 

77.67 

22.47 

45.46 

21.27 

92.77 

53.76 

- 

- 


Case 6 (slope change): similar to Case 5, except that the mean increasing at rate 0.02; 
Case 7 (Gaussian to Gaussian mixture): distribution shifts from J\f{0,l2o) to mixture 
Gaussian 0.3AA(0, 120 ) + 0.7AA(0, 0.1 • / 20 ); 

Case 8 (Gaussian to Laplace): distribution shifts from AA(0,1) to Laplace distribution 
with zero mean and unit variance. 

We evaluate the EDD for each case using 500 Monte Carlo trials. Note that since 
Bq = 20, the EDD of both methods will be at least 20. The results are summarized in 
Table 4, Note that in detecting changes in either Gaussian mean or covariance, the online 
M-statistic performs competitively with Hotelling’s T^, which is tailored to the Gaussian 
distribution. In the more challenging scenarios such as Gase 7 and Gase 8, the Hotelling’s 
completely fails to detect the change-point whereas the online M-statistic can still detect 
the change fairly quickly. 

5.3 Online: EDD of M-statistic versus block-size Bq 

Lastly, we investigate the dependence of EDD of the M-statistic on the pre-defined block 
size Bq. This example may also shed some light on how to choose Bq in practice. The 
rationale is that, on the one-hand, the detection delay (DD) is greater than Bq, since we 
need at least Bq samples to compute one M-statistic. Hence, a large Bq will artificially 
impose a longer EDD. On the other hand, when block size is too small, we may not pool 
enough post-change samples and the statistical power of the M-statistic is weak, which will 
also result in a large EDD. Hence, there should be an optimal choice for Bq that minimizes 
the EDD. This is validated by our numerical example. Gonsider the distribution shifts from 
AA(0,/2 o) to J\f{fi,l 2 Q), with ^ being element-wise equal to a non-zero constant. In Eigure 
6(a), the shift size of the mean is 0.2, where the minimum EDD is achieved by Bq = 28. 
Eigure 6(b) shows the optimal block sizes for a range of values of the mean shift size (from 
0.1 to 1.2). 

6. Skewness correction 

Although our approximations to SL and ARL using the first and second order moments 
of Z'^ work fairly well, Z'^ does not converge to normal distribution in any sense. Below 
we first illustrate this fact, and then present some improved approximations by taking into 
account the skewness of Z'^. 

6.1 Z'^ does not converge to standard normal 

The distribution of Zb (unnormalized Z'^) can be fairly well approximated by a normal, as 
demonstrated earlier in Eigure 3. However, does not converge to normal distribution 
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(a) (b) 


Figure 6; Online M-statistic for a case where the distribution shifts from AA(0, / 20 ) to 
I 20 ), with being element-wise equal to a non-zero constant, (a) log(EDD) versus 
block size and the optimal block size Bq = 28 corresponds to a minimum FDD; (b) optimal 
block sizes versus the size of the mean shift. 


even when B is large and it has a non-vanishing skewness that we will characterize. Recall 
that Zs is zero-mean. Hence, the skewness of Z'b is related to the variance and third-order 
moment oi Zq via 

k{Z'b) = = Var[ZB]-3/2E[ZB=^]. (12) 

First, we obtain an analytic expression of the third order moment. 

Lemma 5 (Third-order moment of Zb under the null) 

^[^b] = I [hix,x',y,y')h{x',x",y',y")h{x",x,y",y)] 

+ y')Kx', x\ y', y")h{x"', x"\ y", y)] 

+ — — [h{x, x', y, y')h{x", x'", y', y")h{x"", x'"", y", y)] | 
4(1 

+ y^ yTHx", x”', y, y')] 

+ ———^E [h{x, x', y, y')h{x", x'", y, y')h{x"", x'"", y, y')] | . 

Lemma 5 leads to the fact the skewness of Z'^ is non-zero. This is because the third-order 
moment of Zb scales as 0{B~^) (due to (13)), but when dividing via (12) by its variance 
which scales as 0{B~‘^), the skewness becomes a constant with respect to B. Furthermore, 
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examining the Taylor expansion of moment generating function at 0 = 0, we have 
E[e^^B] = 1 + E[Z^] 0 + 4 + o{e^). 


Recall that the moment generating function of a standard normal Z is given by E[e^^] = 
1 + ^ + o{9^). The difference between the two moment generating functions is given by 

E[e^^B] - E[e^^] = ^■^\E[{Z'j^f e^'^' b]\ + o{e^) > 4rc|E[(Z^)3]| + o{e^), (14) 

0 D 

where the inequality is due to the fact that ^b >0 and we may assume it is larger than 
an absolute constant c. From (12), the first term on the right hand side of (14) is given by 
(c0^/6)Var[ZB]~^'^^|E[ZB^]|, which is clearly bounded away from zero. Hence, 


E[e^^h] - (1 + y; 


> ^7 + o( 0 ^) 


for some constant 7 > 0. This shows that the difference between the moment generating 
functions of Z'^ and a standard normal is always non-zero and, hence, Z'^ does not converge 
to a standard normal in any sense. 

Although the result above is discouraging, the difference between the moment generating 
functions of Z'^ and the standard normal distribution is not very large and can be upper 
bounded. By applying a result on Page 220 of (Yakir, 2013), we have 


E[e®^s] - (1 + 



< minj— 
^ 6 




B\ 


e ^ E [\ z ' B \ A }- 


and if considering the skewness k{Z'^) 


E[e' 


ez' 


, 02 03k(Z'3), 

-(l + y + — 




6.2 Skewness correction for significance level and ARL 


By taking into account of the skewness of Zb, we can improve the accuracy of the approxi¬ 
mations for SL in Theorem 3 and for ARL in 4, Recall that when deriving approximations 
using change-of-measurement, we choose parameter 9b such that the moment generating 
function iPb{9b) = b. If Z'^ is assumed to be a standard normal, 'iI>b{9) = 0^/2, and hence 
9b = b. Skewness correction can be achieved by incorporating an additional term for the 
log moment generating function when solving for 9b'- 

^ b ( 0)~0 + E [ Z ) j ^] 02/2 = 6 . 

This will change the leading exponent term in (9) from e“^^/2 be e^s(^s)“^s^. For 
instance, the approximation for SL of the offline M-statistic with the skewness correction 
is given by 


max 




Bn 


R ^J\w:[Zb\ 


>R = E 


B=2 


2V^B{B - 1 ) 


u bx 


2B - 1 
B{B-1)^ 
(15) 


[1 + 0 ( 1 )] 


20 




M-Statistic for Kernel Change-Point Detection 


A similar correction can be done for the ARL approximation in Theorem 4, The skewness 
correction can be important, since it appears in the exponent of the expressions. We found 
that the skewness correction is especially useful when SL is small (e.g. a = 0.01) for the 
offline case or when block size Bq is small (see Table 1, 2 and Fig. 5). 

Another consequence of of Lemma 5 is that we can estimate the skewness k{Zb) ef¬ 
ficiently and avoid the onerous direct sampling of Zb- Lemma 5 and (12) reduce the 
skewness estimation to evaluating eight simpler terms in (13). For instance, to evaluate 
E [h{x, x', y, y')h{x', x”, y', y”)h{x", x, y”, y)], we may use direct Monte Carlo: draw a group 
of six samples samples without replacement from the reference data, treat them as x, x', 
x"^ y, y' and y", evaluate the sampled function value, repeat and then form an average. 

7. Real-data 

We also test the performance of our M-statistics on real data. Our datasets include: (1) 
CENSREC-l-C: a real-world speech dataset in the Speech Resource Consortium (SRC) 
corpora provided by National Institute of Informatics (NII)^; (2) Human Activity Sensing 
Consortium (HASC) challenge 2011 data.^. We compare our M-statistic with a baseline 
algorithm, the relative density-ratio (RDR) estimate (Song et ak, 2013). One limitation of 
the RDR algorithm, however, is that it is not suitable for high-dimensional data because 
estimating density ratio in the high-dimensional setting is an ill-posed problem. To achieve 
reasonable performance for the RDR algorithm, we adjust the bandwidth and the regu¬ 
larization parameter at each time step and, hence, the RDR algorithm is computationally 
more expensive than using M-statistics. We adopt the Area Under Curve (AUC) (Song 
et ak, 2013) (the larger the better) as a performance metric. 

Our M-statistics have very competitive performance compared with the baseline RDR 
algorithm on the real data. Here we report the main results and omit the details due to 
space limit but they can be found in Appendix A, For speech data, our goal is to online 
detect the onset of a speech signal emergent from the background. The backgrounds are 
taken from real acoustic signals, such as noise in highway, airport and subway stations. The 
overall AUC for the M-statistic is 0.8014 and for the baseline algorithm is 0.7578. For 
human activity detection data, our goal is to detect a transition from one activity to another 
as quickly as possible. Each data consists six possible of human activity data collected by 
portable three-axis accelerometers. The overall AUC for the M-statistic is 0.8871 and for 
the baseline algorithm is 0.7161. 

8. Discussions 

There are a few possible directions to extend our M-statistics. (1) Currently, we assume 
that data are i.i.d. from a null distribution P and when the change happens, data are 
i.i.d. from an alternative distribution Q. Under these assumptions, we have developed 
the offline and online change-point detection method based on the kernel two-sample test 
statistic MMD. Our M-statistic can detect the existence of such a change powerfully and 
quickly. Moreover, the M-statistic can also pinpoint the change-point accurately. One may 

2. Available from http://research.nii.ac.jp/src/en/CENSREC-l-C.html 

3. Available from http://hasc.jp/hc2011 
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relax the temporal independence assumption and extend M-statistics for dependent data 
by incorporating ideas from (Chwialkowski and Gretton, 2014). (2) We have demonstrated 
how the number of blocks and block size affect the performance of M-statistics. One 
can also explore how kernel bandwidth as well as the dimensionality of the data would 
affect the performance. An empirical observation is that the performance of MMD statistic 
degrades with the increasing dimensions in data. Some recent results for kernel based test 
can be found in (Ramdas et ah, 2015). We may adopt the idea of (Ramdas et ah, 2015) 
to extend our M-statistics for detecting change in the dependence. (3) Lastly, for really 
high dimensional dataset with large Gram matrix, one can perform random subsampling to 
reduce complexity similar to (Xie et ah, 2015). 

Appendix A. More details for real-data experiments 
A.l CENSREC-l-C Speech dataset 

GENSREC-l-C is a real-world speech dataset in the Speech Resource Consortium (SRC) 
corpora provided by National Institute of Informatics (NII)^. This dataset contains two 
categories of data; (1) Simulated data. The simulated speech data are constructed by 
concatenating several utterances spoken by one speaker. Each concatenated sequence is 
then added with 7 different levels of noise from 8 different environments. So there are totally 
56 different noise. Each noise setting contains 104 sequences from 52 males and 52 females 
speakers. (2) Recording data. The recording data is from two real-noisy environments 
(in university restaurant and in the vicinity of highway), and with two Signal Noise Ratio 
(SNR) settings (lower and higher). Ten subjects were employed for recording, and each one 
has four speech sequence data. 

Experiment Settings. We will compare our algorithm with the baseline algorithm from 
(Song et ah, 2013). (Song et ak, 2013) only utilized 10 sequences from “STREET_SNR_HIGH” 
setting in recording data. Here we will use all the settings in recording data, the SNR level 
20db and clean signals from simulated data. See Figure 7 for some examples of the testing 
data, as well as the statistics computed by our algorithm. For each sequence, we decompose 
it into several segments. Each segment consists of two types of signals (noise vs speech). 
Given the reference data from noise, we want to detect the point where the signal changes 
from noise to speech. 

Evaluation Metrics. We use Area Under Gurve (AUG) to evaluate the computed statis¬ 
tics, like in (Song et ah, 2013). Specifically, for each test sequence that consists of two signal 
distributions, we will mark the points as change-points whose statistics exceed the given 
threshold. If the distance between detected point and true change-point is within the size 
of detection window, then we consider it as True Alarm (True Positive). Otherwise it is a 
False Alarm (False Positive). 

We use 10% of the sequences to tune the parameters of both algorithms, and use the rest 
90% for reporting AUG. The kernel bandwidth is tuned in {O.ldmed) 0.5(iined) cfmed) 2(iined) Sdmed}; 
where dmed is the median of pairwise distances of reference data. Block size is fixed to 50, 
and the number of blocks is simply tuned in {10, 20, 30}. 


4. Available from http://research.nii.ac.jp/src/en/CENSREC-l-C.html 
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Results. Table 5 shows the AUC of two algorithms on different background settings. Our 
algorithm surpasses the baseline on most cases. Both algorithms are performing quite well 
on the simulated clean data, since the difference between speech signals and background 
is more significant than the noisy ones. The averaged AUC of our algorithm on all these 
settings is .8014, compared to .7578 achieved by baseline algorithm. See the ROC curves 
in Figure 8 for a better comparison. 
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Figure 7: Examples of speech dataset. The red vertical bar shown in the upper part of each 
figure is the ground truth of change-point; The green vertical bar shown in the lower part 
is the change-point detected by our algorithm (the point where the statistic exceeds the 
threshold). We also plot the threshold as a red dash horizontal line in each figure. Once 
the statistics touch the threshold, we will stop the detection. 


A.2 HASC human activity dataset 

This data is from Human Activity Sensing Consortium (HASC) challenge 2011^. Each 
data consists of human activity information collected by portable three-axis accelerometers. 
Eollowing the setting in (Song et ah, 2013), we use the ^ 2 -iiorm of 3-dimensional data (i.e., 
the magnitude of acceleration) as the signals. 

We use the ‘RealWorldData’ from HASC Challenge 2011, which consists of 6 kinds of hu¬ 
man activities (walk/jog, stairUp/stairDown, elevatorUp/elevatorDown, escalatorUp/escalatorDown, 
movingWalkway, stay). We make pairs of signal sequences from different activity categories, 
and remove the sequences which are too short. We finally get 381 sequences. We tune the 
parameters using the same way as in CENSREC-l-C experiment. The AUC of our algo- 

5. http://hasc.jp/hc2011 
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Figure 8 : AUC comparison on speech dataset 
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Table 5; AUC results in CENSREC-l-C speech dataset. Simulated data are from 8 
noise categories, and with two different noise levels (clean(C) and SNR 20db (S)); Record¬ 
ing data are from RESTAURANT_SNR_HIGH (RH), RESTAURANT_SNR_LOW (RL), 
STREET_SNR_HIGH (SH) and STREET_SNR_LOW (SL). 


(a) Recording data 



RH 

RL 

SH 

SL 

Ours 

0.7800 

0.7282 

0.6507 

0.6865 

Baseline 

0.7503 

0.6835 

0.4329 

0.6432 

(b) Simulate clean data 



G1 

G2 

G3 

G4 

G5 

G6 

G7 

G8 

Ours 

0.9413 

0.9446 

0.9236 

0.9251 

0.9413 

0.9446 

0.9236 

0.9251 

Baseline 

0.9138 

0.9262 

0.8691 

0.9128 

0.9138 

0.9216 

0.8691 

0.9128 


(c) Simulated data with SNR=20db 



SI 

S2 

S3 

S4 

S5 

S6 

S7 

S8 

Ours 

0.7048 

0.7160 

0.7126 

0.7129 

0.7094 

0.7633 

0.6796 

0.7145 

Baseline 

0.7083 

0.6681 

0.6490 

0.7119 

0.6994 

0.6815 

0.6487 

0.6541 


rithm is .8871, compared to .7161 achieved by baseline algorithm, which greatly improved 
the performance. 

Examples of the signals are shown in Eigure 9, Some sequences are easy to find the 
change-point, like Eigure 9a, and 9d. Some pairs of the signals are hard to distinguish visu¬ 
ally, like Eigure 9b and 9c. The examples show that our algorithm can tell the change-point 
from walk to stairUp/stairDown, or from stairUp/stairDown to escalatorUp/escalatorDown. 
There are some cases when our algorithm raises false alarm. See Eigure 9h. It find a change- 
point during the activity ‘elevatorUp/elevatorDown’. It is reasonable, since this type of 
action contains the phase from acceleration to uniform motion, and the phase from uniform 
motion to acceleration. 


Appendix B. Proofs 

We start with proving Lemma 6 and Lemma 7, which are useful in proving Lemma 1 and 
Lemma 2. 


Lemma 6 (Variance of MMD, under the null.) Under null hypothesis, 

-1 


Var 




¥.[h^{x,x ,y,y)], i = 


(16) 


Proof [Proof of Lemma 6 Eor notational simplicity, we drop the superscript B. Eurther- 
more, under the null hypothesis all data follow the same distribution, we can represent Xi^i 
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figure are the same as in Figure 7, 


and Xij as x and x', and Yi and Yj as y and y', respectively. For any i = 1, 2,..., n, by 
definition of U-statistic, we have 


Var [MMD2(Xi, T)] = Var 


B 


-1 

Y,hiX,^i,X,^„Yi,Yj] 


Kj 


-2 r 


B\ (2 


l) (f- [^xiy[h{x,x',y,y')]] 


( 17 ) 


+1 f ^ f ^ ) Var [h{x, x', y, y')] 


2J \2J \ 2-2 _ 

Under null distribution, Kxfy[h{x,x',y,y')] = 0. Thus, Var [Ea;.y[/i(x, x', y, y')]] = 0, and 
Var [h{x, x', y, y')] = E[h^ix, x, y, y')] - E[/i(x, x, y, y)f = E[h^{x, x', y, y')]. 
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Substitute these results in (17), we obtain the desired result (16). 


Lemma 7 (Covariance of MMD, under the null, same block size.) For s / 0, un¬ 
der null hypothesis 


Cov[h(x„x',y,y'),h(x,+,,x'+,,y,y')] 


Cov 


Proof [Proof of Lemma 7| For notational simplicity, we drop the superscript B. For i = 
1,2,..., N, and s = (1 — i), (2 — i),.. ., {N — i), s / 0, 

Cov [MMD2(Xi,y),MMD2(yi+„y)] 

= Cov I ( Y.h{Xi,i,Xij,Yi,Y^), ^h(Xi+,,p,Xi+s,g,rp,rg: 


KJ 


p<q 


f2\ fB-2 


2/ VI/ V2-1 


Cov \h(yXij^, Xij, y, Yj^, h(yXi^s,pi Xi-\.s^q, Yp, Yq)] 


/ B\ / 2 \ / B _ 2 \ 

”^^ 2 / ( 2 /( 2/(2 2 ) Xij,y,y ), h[Xi^g^p, Xi^g^q,y ,y )j . 

Under null distribution, 

Cov [h(^Xi^i, Xij, y, Yj ), hi^X^^g^p, Xj^^g^q, y, Fq)] 

~ J ) Xij ,y,Yj, XiJ,.g^p, XiJ^g^q, y, Yqjhl^Xj^i, X^j, y, Yj)h(^Xi^g^p, Xi^g^q, y, Yq) 

= j F[X^,y]F[Xi+g^p,y] j F[Xi^j,Yj\h{Xi^i, Xij,y,Yj) J F[Xi+g^q,Yq]h{Xi+g^p, Xi+g^q,y,Yq) 

= 0 . 

Finally, we have: 


B 


-1 


Cov [MMD2(yi,y),MMD2(yi+„y)] = Cov [hiXi^i,Xij,y,y'),hiXi+g,p,Xi+g^q,y,y')] 

Under null hypothesis, Xi^i, Xij, and are independent and they follow the 

same null distribution, so we may replace them with x, x', x”, x"' respectively. Finally 


Cov [MMD\Xi, y; B),MMD\Xi+g, Y-,B)] = (^] Cov [h{x, x', y, y'), h{x", x'", y, y')] . 


B 


-1 
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Proof [Proof for Lemma li For notational simplicity, we drop the superscript B. Using 
results in Lemma 6 and Lemma 7, we have 


Var[ZB] = Var 
1 


1 ^ 

-Y.MMD\X,,Y) 


i=l 


iVVar[MMD2(Xi,y)] + ^Cov [MMD2(Xi, U; 5), MMD2 (Xj, U)] 

i¥=j 




N \ 2 


-1 r 


iV2 ^ \ 2 

i¥=j 

j^E[h‘^{x, x', y, y')] + ^^ ^ Cov [h{x, x', y, y'), h{x", x"', y, y')] 


Proof [Proof of Lemma 2| For the offline case, we have that the correlation 

1 1 


rB,B+v = 


where 


Cov {Zb, Zb+v) = Cov 


^J\ai[ZB] v^Var[ZB+„] 


N 


Cov [Zb, Zb+v] , 




N ^ Vi’ n 

i=l j=l 


+ ^ ^ Cov MMD^ MMD^ , y(-5+^)) 

i¥=j 


Using results from Lemma 6 and Lemma 7, we have; 

i 

2 


Cov {Zb, Zb+v) = 


l(By{B + v)\ „/v 


E[h {x,x ,y,y)] 


+ 


N-l f By {B + v) 


N 


-1 


Cov [h{x, X , y, y ), h{x", x"', y, y) 


By {B + v) 

2 


-1 


—E[h {x,x ,y,y)] 


+ [h{x, x , y, y'), h{x", x", y, y')] 


Finally, plugging in the expressions for Var[yB] and Var[i3 + u], we have (7) for the offline 
case. 
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For the online case we need to analyze r' = Cov (Mt, Mt+s) ■ Without loss of generality, 
assume s > 0. We may use the covariance result above for a fixed block size Bq to obtain 


Cov (mMD^, y), MMD^(xf, y(- 80 , 4 +^) 
jVarlh(x,x',i/,y')], 


B\ fB-s 


(18) 


and 


I Cov(/i(x, X, y, y'), h{x", x”', y, y')). 


B\~‘^ fB-s 


(19) 


Thus, 


Cov {ZBo^t, ZsQ^k+s 


/ TV TV 

Cov — ^MMD2(yf^°’*\y('®0’*)), —^MMD2(xj^°’*+"\y(^0’*+")) 


i=l 


Boy^fBo-s 


i=i 

1 y — 1 

—Var(/i(x, x', y, y')) H- —Cov{h{x, x', y, y'), h{x", x'", y, y')) 


( 20 ) 


We have: 


(BA I 5nM B.-l 


\ 2 ) 


Lemma 8 (Covariance of MMD, under the null, different block sizes, same block index.) 

For blocks with the same index i but with distinct block sizes, under the null hypothesis we 
have 


Cov [MMB‘^{Xi,Y;B),MMD^{Xi,Y;B+ v) 


yviB + v)^^ 


E[h^{x,x,y,y)]. 


( 21 ) 
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Proof [Proof of Lemma 8| Note that 


Cov 

= Cov 




Kj 


p<q 


-1 



B 

B+v 

Y,h{Xi^uXi,j,YuYY 

1 ^ ^ h{Xi^p, Xi^q, Yp, Yq) 


Kj 

p<q 


B\ fB + v\ fBA{B + v) 


V V 2 , 
BV{B + vy 
2 


Va.i[h{x,x',y,y')] 


E[h‘^{x,x,y,y)], 


where the second last equality is due to a similar argument as before to drop block indices 
as they are iid.under the null. ■ 


Lemma 9 (Covariance of MMD, under the null, different block sizes, block indices.) 

Under the null we have 


Cov 


MMD^ (xf, y, MMD^ (Xff, y) 


-1 


BU {B + vY 
2 

Cov [h{x, x', y, y'), h{x”, x'", y, y')] 


Proof [Proof of Lemma 9| Note that 

MMD^ (xf, y, MMD^ (Xff, y) 


Cov 

= Cov 


b\-Wb + v^ 


"f:hixs\xi-\Yr,y; 
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{ B ) y { B ) ^{ B ) ^{ B )-, (B + V 
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-1 B+v 
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Cov [h{x, x', y, y), h{x", x", y, y') 


By {B + v) 


-1 


Cov [h{x, x', y, y), h{x”, x'", y, y')] , 


where the second last equality is due to a similar argument as before to drop block indices 
as they are i.i.d.under the null. 
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Proof [Proof of Theorem 3.] Define Z'^ = ^s/'\/Var[ZB]. We would like to study 

< max Zr > b 
[B£[2,M] 

Recall that is set to the solution to ^b{G) = b and ip Bid) = logE[e^^B] is the log 
moment generating function. Under null hypothesis, we may approximate the distribution 
Z'b AA(0,1). Hence, ipBids) = and the solution 6b to ip{d) = b becomes 

6b = b, and = b‘^/2. 

In the following we will use the “likelihood ratio identity” trick, which computes a 
probability of an event formulated in some distribution by reformulating it as an expectation 
in the context of an alternative distribution (Siegmund, 1985; Yakir, 2013). We use the 
notation Kb[U]A] to indicate that the expectation involves the product between the random 
variable U and the indicator of the event A. Associate with each B, B G [2, Hmax] a log- 
likelhood ratio of the form 



iB = 0 bZ'b- i^B (Ob) = bZ's -b^/2. (22) 

With the aid of such log-likelihood ratios we may produce the likelihood ratio identity: 


max Z'n > 6 ^ = E 

Be[ 2 ,Smax] 


^;pBmax Ib 

—-; max Z'^ > b 

Be[2,Bn,ax] 


= 1 


(23) 


B, 


max r 


B=2 




; max Z'^ > b 


Bn 


Yj 


B=2 


1 


; max Z'^ > b 


■Be[2,Bmax] 


where P_b is the alternative distribution that is associated with the likelihood ratio iB, and 

Eb[U] = 


A local random field is produced by the consideration of difference between the log- 
likelihood ratio at B and the log-likelihood ratios at other parameter values for the block 
size. Using (22), the components of the local field are: 

l,-lB = b{Z'^-Z'j,). (24) 

Our approximation will depend on the summation and maximization statistics of the local 
field: 

Mb = max and Sb = max 

Be[2,Bmax] Be[2,Bmax] 

Also introduce the re-centered log-likelihood ratio: 


Ib = 6BiZ'B - ^(0 b )) = biZ's - b), 
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By introducing and subtracting or dividing terms in (23), we may write it in a form 
that is convenient to apply Theorem 5.2 in (Yakir, 2013): 


Bn 


E 

B=2 




E_b 


Jb max^^i2,B]{Zs-Z'j^} -eB[Z'g-b+ma.x^^l2,Bn,n^]{Zs-Z'g}] 


E 


S&[2,Bn 


^6BZ's—^pB{SB) 


= 


Bn 




B=2 


Z'f> — 6 + max | Z' — Z'r, | > 0 

s£l2,Bn,n^] ^ 


JB 


(25) 


To apply the localization theorem (Theorem 5.2 in (Yakir, 2013)), we need to identify 
the local limit distribution of £b and of the local field {is — is ■ s ^ [2,i3max]} and prove 
asymptotic independence between them. The analysis of the limiting distributions should 
be done under the alternative distribution ¥b- Under the alternative distribution P^, we 
get that Es[£b] = 0, since Eb[^s] = b, and the variance is Vars(^B) = b'^YaiBiiB) = 
6^'0b(0b) = 6^, since '4’b{G) = 0^/2. On the other hand, using a decomposition technique 
similar to that is used for the proof of Lemma 10, the covariance between the local field 
[is-iB] and the re-centered log-likelihood ratio Ib is given by 


Cov (is — 
= -b^(l 


iB,iB) = Eb[ 6(Z( - Z^) • b(Z{, - b)] 
- rs,B) ~ -b^^ 


2I 2(i?-l) 


2B{B - 1) 


B-s 


-b\l - rs,B)^B[Z'B(ZB - b)] 


(26) 


Hence, the asymptotic independence between the local field and the re-centered log-likelihood 
ratio follows from the fact that, when 6 —)• 00 and bj^/B —)• c for some constant c, if \B — s\ 
is small, the covariance between is — iB and t's is on the order of a constant. However, the 
standard deviation oUb diverges to infinity proportional to b. Consequently, the correlation 
between the global term and local fields tends to 0. 

We will approximate the limit joint distribution of the local field and the global term is 
Gaussian. Computation of the expectation and covariance structure are sufficient for obtain 
the final approximation. Lemma 10 shows that the asymptotic distribution of {is — 
ioT s = B + j and \j\ not too large, is a two-sided Gaussian random walk with a negative 
drift. The variance of an increment of this random walk is 

Using the localization theorem (Theorem 5.2 in (Yakir, 2013)), since the local field and 
the re-centered log-likelihood ratio are asymptotically independent when 6 — 00 , we have 


Eb 


^g-fe+iogMs].^'^ + log Mb > 0 

OB 


2 2V2J 


(27) 


Finally, combine the results above we obtain (9). 
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Lemma 10 (Offline, analysis of mean and variance of local field. ) The mean and 
varianee of the local field { is+v — for v = 0, ±1, ±2, .. are related by 

^b[^b+v — ^b] = — ^b\- (28) 

Moreover, given g, = 

„2 

"^BldB+v — (-b] ^—^\v\, VarB[fB+D —-ffi] ~(29) 

Proof [Proof of Lemma 10| From the definition of the local field (10), we have that for 
s = B + v. 

Eb [Ib+v - iB] = Eb [h{Z'B+, - Z'b)] = E \b{Z'B+, - 

^ ^ (30) 

= E [(-6(1 - rB+.,B)Z'B + b^l-rl+,,BW) _ 

The above representation results from the regression of on Zfi: 

^B+v = T'B+v,bZ'b + Y^l — ''^‘b+v,B^i 

with IF being the standardized residual of the regression, and r = Cov [Z'^, . Since 

IF is zero-mean and independent of Z'^, (30) becomes 

Eb[Ib+v - iB\ = -b{l - r)E [z^e^^B-'>V2j = ( 31 ) 

and the last equality follows from the Gaussianity Z'^ r-^ AA(0,1): 

E \zfie^^B-hb^] = = b. (32) 

J v27r y 'fi2TT J 

Similarly, we can compute the variance of the local field under the transformed measure 

\s.XB[iB+v - iB] = Vars [6(Z^+, - Zfi)] = Vare \hrZ'B + b - Zfi 

= YavB \bVl - r^w] + Vars [6(r - 1)Z'b] 

= Eb[62(1 - r^)W‘^] - Eb[6\/i -r2lF]2 

+ Eb [b\r - ifiZfif] - Eb [6(r - l)Zfi] ^ 

= 6^(1 - r^) + b^{r - if = 26^(1 - r). 


Hence, we have the desired result (28). 

Next, using results from Lemma 2, we have that 


rB,B+v 




(33) 
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We will linearize r in terms of small increment v. For n > 0, using Taylor’s expansion 
(1 + u)~^ = 1 — X + o{u)-. 


I BjB-l) 

^ {B + v){B + v-l) 


and for u < 0, 





B{B - 1 

Combine these two cases, we have 


B 


B - 1 


rB,B+v = 


1-m 1-^ 

B \ B-1 


Substitute this in (31), we have that 

6^ 25-1 


^b[£b+v — ^b] = — 




(35) 


+ o(H) = 1-1T|^|„|+o(|„|), (36) 


Lemma 11 (Tail of statistics under the null) When 6 —)■ oo, 

/ 


{T < m} = 


^Bo,t , 

max — , = > 0 


= me 2 


[o<i<m y/Var [ZboA 


B{B - 1)V^ 


[1 + 0 ( 1 )]. 


Proof [Proof for Theorem 4, Let Z'^ := Zbq^iI -y/Var[Z Bo,t\- We start with finding the tail 
probability of the online detection statistic. Note that 


TT < m) = f max M* > 6 ) = J 
\l<t<m J I : 


^BQ,t , 1 

max — , = > b 


i<i<m Y^Var [ZBo,t\ 
Since the block size is fixed to be Bq, using Lemma 1, we have that 


(37) 


Var(Z;) = Var(Z;+J 


Be 


-1 


iV- 1 


—Var(/r(x, x', y, y')) H-—Cov(h(x, x', y, y), h{x\ x”, y, y')) 


Similar to previous analysis, we analyze the local field {4+s “ 41 where 

4 = 6z;-6V2, <+, = - 6V2. 


(38) 
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Use a similar change-of-measure argument, for the sequential problem, we have that (37) 
can be written as 




where the maximum and the sum of the local fields are 


Mt = max e 




. 5<= E 


Jm-it 




(39) 


(40) 


and the re-centered log-likelihood ratio is given by 

it = (.t- b, mt = logMt- (41) 


From Lemma 2, ~ 1 — With similar analysis as for the offline case, we can 

also show that the mean and the variance of the local field terms are 


Et{£t+s-^t} = -b‘^{l-rt+s,t) = b(B -\) Vari{£t+, - |s|. 

'-V-" '-V-" 

/ i 2/2 fj? 

(42) 

And similarly, we may show that the local field terms and the re-centered log-likelihood 
ratio are asymptotically independent. Then using the localization theorem (Theorem 5.2 
in (Yakir, 2013)), we can write (39) as 


{T ^ m} 




E 


b‘^{2B - 1) 


^ ^(^-1) 


= m ■ 


b‘^{2B - 1) 
B{B — 1) 



(43) 


where the last equation is due to the fact that the terms inside the sum are constants that 
are independent of t. Using similar arguments as those in (Siegmund and Venkatraman, 
1995; Siegmund and Yakir, 2008), we may see that T is asymptotically exponentially dis¬ 
tributed and is uniformly integrable. Hence, if A denotes the factor multiplying m on the 
right-hand side of (43), then for still larger m, in the range where mX is bounded away 
from 0 and oo, P°°{T < m} — [1 — exp(—Am)] —)■ 0. Consequently E°°{T} ~ A“^, which is 
equivalent to (10). ■ 


In the following, Lemma 12, Lemma 13, and Lemma 14 are used to derive the final 
expression for the skewness of the statistic: 

Lemma 12 Under null hypothesis, 


E 


{MMlf{Xi,Y)Y 


8{B - 2) 

1)2 


E [h{x, x', y, y')h{x', x", y, y")h{x", x, /, y)] + 


H2(H- 1)2 


E [h{x,x',y,y'f 


35 





Li, Xie, Dai and Song 


Proof [Proof of Lemma 12 


(MMD2(yi,y))^ 

= ( IE 

\2 J 

{y,h{Xi^a:Xi^b,Ya,Yb^ 



\a<b / 


2 J ^ ^ [habhcdhef] ; 


where for simplicity we write hab = h{Xi^a, Xi^b,Ya,yb) and define Ck the corresponding 
number of combination under specific structure. Most of the E [habhcdhef] vanish under the 
null. By enumerating all the combinations, only two terms are nonzero: E [habhbchca\ and 
E . Then, 


E [(MMD2(Xi,y))^ 

8(5 - 2) 


B\ /B 


2(5 — 2)E \habhbchco\ + 


B\ /5 




E [h{Xi^a, Xi,b, Ya, YbMXi^b, Xi,c, n, Y,)hiXi^c, Xi^a, Y^, Ya)] 


52(5-1)2 

+ ... E [h{X,^a,Xi,b,Ya,Ybf 


52(5-1)2 
8(5 - 2) 


52(5-1)2 


E [h{x, x', y, y')h{x', x", y', y")h{x", x, /, y)] + 


52(5-1)2 


E [h{x,x',y,y' f . 


Lemma 13 Under null hypothesis, 




E [ {MMD‘^{Xi, y))^ MMD^{Xj,Y) 

" B^B -1^)2 ^ y'^ y")Hx"', x'”', y”, y)] 


+ 


52(5-1)2 
Proof [Proof of Lemma 13 

E 

L' 

-3 


E [h{x, x', y, y'fh{x'', x'", y, y') . 


5 


(MMD2(yi, y))^ mmd2(Xj, y)J 

h{Xi^a, Xi^b, Ya, Yb)] ( ^ h(y,,a, X,-fe, y„ Yb) 


E 


-I i¥=j 
2 , 


\a<b 


\a<b 


5 


-3 

^ ^ C*fcE \hi^abhi^cdhj^ef \ > 


where for simplicity we write hi^ab = h{Xi^a, Xi^b,Ya,Yb) and define Ck the corresponding 
number of combination under specific structure. Similarly, most of the E [hi^abhi^cdhj^ef] 
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vanish under the null. By enumerating all the combinations, only two terms are nonzero: 
^ [^i,abhi^bchj^ca\ Rnd E [/l^ • Then, 


E 


(MMD2(Xi, Y)f MMD2(Xj, y) 


B\ fB 


-I 

2)E [hi^abhi^bchj^ca] T 


B\~^ fB 


T [^i,abhi^abhj^ab\ 


' B^{B yc)h{Xj^„ Xj^a, Y„ y,)] 


+ 


B2(B- 1)2 
8{B - 2) 


E [h{Xi^a, Xi^b, Ya, YbfhiX.^a, Xj^h, Y^, Yf,) 


.62(5-1)2 

4 


E [h{x, x', y, y')h{x', x", y', y")h{x'", x"", y", y) 


+ 


.62(5- 


y^E [h{x, x', y, y'fh{x'', x'", y, y') . 


Lemma 14 Under null hypothesis, 


E [MMLf (Xi , Y)MMD^ {Xj , y) MMlf {Xr , y)] 

B^{B - 1 ^) 2 ^ ^'''' y'' y”^ y)] 

+ 52 ( 5 ^^, 1 ^ 2 ^ y^ y')Kx'\ x"', y, y')h{x”", x'"", y, y')] . 


Proof [Proof of Lemma 14 1 Note that 


E [MMD2(yi, y)MMD2(Xj, y)MMD2(y^, y)] 




B 


B 


-3 


E 


h{Xi,a, Xi^h, Ya, n) h{Xj,a, Xj^d: Y^, Yd) 


\a<b 


\c<d 


^h{Xr,e,XrJ,Ye,Yf) 
\e<f 


-3 


^ ^ C*/jE \hi^abhj^cdhr,ef) • 
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Similarly, most of the E [hi^abhj^cdhr,ef] vanish under the null. By enumerating all the 
combinations, only two terms are nonzero; E [hi^abhj^bchr,ca] and E [hi^abhj^abhr,ab\ ■ Then, 


E [MMD2(Xj, y)MMD2(Xj, y)MMD2(X^, y)] 


i-¥=j¥=r 

b\ f b\ 

2)E [hi^abhjfichr,ca] ”^( 2 ) ( 2 )^ 


'' B^% n, Y,)h{Xr,c, Xr,a, Ya)] 


+ 


jE[h{Xi^a,X,^h,Ya,Yb)h{X, 

,a; ^j,b^ Ya,Yh)h{Xr,a,Xr,b,Ya,Yh)] 


B2(B- 1)2 

8{B — 2) r f I n m I //,, / //// ///// // N-| 

j^E , 2 /,y )h(x ,x ,y ,y )h{x ,x ,y ,y)\ 


+ 


B 2 (B- 1)2 


E [h{x, x', y, y')h{x", x"', y, y')h{x””, x"'", y, y') . 
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Proof [Proof of Lemma 5| We have 


E[Z|] =E 


' N N 

- j;MMD2(w,y) 


= ^E 

ATS 


2=1 

N 


= ^NE 

ATS 


^MMD2(W,P) ) I ^MMD2(Xj,y) ( ^MMD2(X^,y) 

C=i / \i=l J \r=l / 

1 /3\ (N\ (N - P 


+ 


(MMD 2 (W,y)) 

1 /iV\ fN -l\ fN -2 



^ \ 2 ) V 1 


ATS 


(T) (^ 1 ~ 0 (^ 1 ~ [MMB\Xi,Y)MMD\Xj,Y)MMB\Xr,Y)]^^.^^ 




(MMD 2 (W,P))^ 


3(A^- 1 )^ 


(MMD2(Xi, Y)f MMD2(Xj, y) 


J 


+ [MMD2(W,y)MMD2(X,,y)MMD2(X„y)] 

_ 1 f 8 (B - 2) „ r. . , „ .n 4 


7V2 \ B'^{B-lY 


E [h{x, x', y, y)h{x', x", y, y')h{x", x, /, y)] -f 


^2(5- 1 )' 


rE [h{x,x ,y, 


3{N 1) j 8{B 2) p ^ , i\u( ! n ' "\u( I" nn " V\ 

H-[/i(x,x ,y, 2 /)/i(x ,x ,y )/i(x ,x ,y ,y)\ 


+ - 


B 2 (s_ 1)2 

(iV-l)(jV-2) f 8(i?-2) 

A^r - . ^ 

4 


E [/i(x, x', y, y'fh{x'\ x'", y, y')] | 


+ 


-1 ■B2(^ -1)2 ^ y'^ y")Kx"", x””, /, y)] 

B^{B- 1 ) 2 ® )] I 


^ aIe®' y^ y')Kx', x'\ y', y'')h{x'\ x, y\ y)] 


B2(S- 1)2 [A^2 
3(iV- 1) 


+ 

+ 


iV 2 

(A^- l)(A^-2) 


E [/i(x, x', y, y')h{x', x", y', y")h{x"', x"", y", y)] 

E [/i(x, x, y, y')h{x", x'”, y', y")h{x"", x'"", y", y)] | 


+ 


^ ^^[h{x,x,y,yf] 


B‘^{B - 1)2 [ A ^2 
{N -1){N -2) 


+ - 


iV 2 


E [/i(x, x, y, y')h{x", x'", y, y')h{x'"', x'”', y, y')] | . 
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